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i_J , Abstract 



We study the problem of navigating through a database of similar objects using com- 
parisons. This problem is known to be strongly related to the small- world network design 
problem. However, contrary to prior work, which focuses on cases where objects in the 

Qs ' database are equally popular, we consider here the case where the demand for objects may 

\f^ \ be heterogeneous. 

^^ ' We show that, under heterogeneous demand, the small-world network design problem 

is NP-hard. Given the above negative result, we propose a novel mechanism for small- 

C^^ ' world design and provide an upper bound on its performance under heterogeneous demand. 

^— ^ . The above mechanism has a natural equivalent in the context of content search through 

comparisons, and we establish both an upper bound and a lower bound for the performance 
of this mechanism. These bounds are intuitively appealing, as they depend on the entropy 
of the demand as well as its doubling constant, a quantity capturing the topology of the 

^^ . set of target objects. Finally, based on these results, we propose an adaptive learning 

\^ ' algorithm for content search that meets the performance guarantees achieved by the above 

^ mechanisms. 

1. Introduction 

The problem we study in this paper is content search through comparisons. In short, a user 
searching for a target object navigates through a database in the following manner. The 
user is asked to select the object most similar to her target from small list of objects. A 
new object list is then presented to the user based on her earlier selection. This process 
is repeated until the target is included in the list presented, at which point the search 
terminates. 



Searching through comparisons is typical example of exploratory search (jWhite and Rothl . 



20091 ). the need for which arises when users are unable to state and submit explicit queries 



to the database. Exploratory search has several important real-life applications. An often- 
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cited example is navigating through a database of pictu res of humans in which subjects ar e 
photographed under diverse uncontrolled conditions (see lTschopp and Diggavll201Ci . l2009l ). 
For example, the pictures may be taken outdoors, from different angles or distances, while 
the subjects assume different poses, are partially obscured, etc. Automated methods may 
fail to extract meaningful features from such photos, so the database cannot be queried in 
the traditional fashion. On the other hand, a human searching for a particular person can 
easily select from a list of pictures the subject most similar to the person she has in mind. 

Users may also be unable to state queries because, e.g., the are unfamiliar with the 
search domain, or do not have a clear target in mind. For example, a novice classical music 
listener may not be able to express that she is, e.g., looking for a fugue or a sonata. She 
might however identify among samples of different musical pieces the closest to the one she 
has in mind. Alternatively, a user surfing the web may not know a priori which post she 
wishes to read; presenting a list of blog posts and letting the surfer identify which one she 
likes best can steer her in the right direction. 

In all the above applications, the problem of content search through comparisons amounts 
to determining which objects to present to the user in order to find the target object as 
quickly as possible. Formally, the behavior of a hum an user can be modeled by a so-called 
comparison oracle introduced by iGoyal et al.l (120081 ) : given a target and a choice between 
two objects, the oracle outputs the one closest to the target. The goal is thus to find a se- 
quence of proposed pairs of objects that leads to the target objec t with as few oracle queries 
as possible. This problem was introduced bv iGoval et al.l (l2008ll and has recently received 
considerab le attention (see, for example, iLifshits and Zhang . 12009 : iTschopp and Diggavl 
20091 . l20inl ). 

Content search through comparisons is also naturally related to the following problem: 
given a graph embedded in a metric space, how should one augment this graph by adding 
edges in order to minimize the expected cost of greedy forwarding over this graph? This 



is kriown as the small-world network design problem (see, for example. iFraigniaud et al 



20061 : iFraigniaud and Giakkoupisl . l201Q ) and has a variety of applications as, e.g., in network 
routing. In this paper, we consider both problems under the scenario of heterogeneous 
demand. This is very interesting in practice: objects in a database are indeed unlikely to 
be requested with the same frequency. Our contributions are as follows: 



We show that the small-world network design problem under general heterogeneous 
demand is NP-hard. Given earlier work on this prob l em un der homogeneous demand 
(IFraigniaud and Giakkoupid . l20ld : IFraigniaud et al.l . 120061 ) , this result is interesting 
in its own right. 



We propose a novel mechanism for edge addition in the small-world design problem, 
and provide an upper bound on its performance. 

The above mechanism has a natural equivalent in the context of content search through 
comparisons, and we provide a matching upper bound for the performance of this 
mechanism. 



Finally, we also establish a lower bound on any mechanism solving the content search 
through comparisons problem. 
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• Based on these results, we propose an adaptive learning algorithm for content search 
that, given access only to a comparison oracle, can meet the performance guarantees 
achieved by the above mechanisms. 

To the best of our knowledge, we are the first to study the above two problems in a 
setting of heterogeneous demand. Our analysis is intuitively appealing because our upper 
and lower bounds relate the cost of content search to two important properties of the demand 
distribution, namely its entropy and its doubling constant. We thus provide performance 
guarantees in terms of the bias of the distribution of targets, captured by the entropy, as 
well as the topology of their embedding, captured by the doubling constant. 

The remainder of this paper is organized as follows. In Section [2] we provide an overview 
of the related work in this area. In Sections [3] and |4] we introduce our notation and formally 
state the two problems that are the focus of this work, namely content search through 
comparisons and small- world network design. We present our main results in Section [5] and 
our adaptive learning algorithm in Section [6l Section [7] is devoted to the proofs of our main 
theorems. We then address the two extensions of our work in Section [8] and finally conclude 
in Section M 



2. Related Work 

Content search through comparisons is a special cas e of nearest nei g hbour search (NNS) , 
a pro blem that has been extensively studied (see IClarksonl . l2006l: Indvk and Motwani . 
199811. Our work can be s e en as an extension of earlier work (JKarger and Ruhll . |2002| : 
Krauthgamer and Led . l2004l : IClarksonl . l2006l ) considering the NNS pro blem for objects em- 
bedde d in a metric space with a small intrinsic dimension. In particular. iKrauthgamer and Lee 
(J2004l ) introduce navigating nets, a deterministic data structur e for supp o rting NNS in dou- 
bling metric spaces. A similar technique was considered by IClarksonl (120061) for objects 
embedded in a space satisfying a certain sphere-packing property, while iKarger and Ruhll 
(J2002 ) relied on growth restricted metrics; all of the above assumptions have connections 
to the doubling constant we consider in this paper. In all of these works, however, the 
underlying metric space is fully observable by the search mechanism while, in our work, we 
are restricted to accesses to a comparison oracle. Moreover, in all of the above works the 
demand over the target objects is assumed to be homogeneous. 

NNS with access to a comparison oracle wa s first introduced by Goval et al. (I200S ), and 
further explored by iLifshits and ZhangI (J2009l ) and iTschopp and Diggavil (J2009|, I2OI0I ) . A 
considerable advantage of the above works is that the assumption that objects are a-priori 
embedded in a metric space is removed; rather than requiring that similarity between ob- 
jects is captured by a distance metric, the above works only assume that any two objects 
can be ranked in terms of their similarity to any targer by the coni parison oracle. To pro- 
vide performance guarantees on the search cost, ICoval et al.l ( 20081 ) introduced a so-called 
"disorder-constant", capturing the degree to which object rankings violate the triangle in- 
equality. This disorder-constant plays roughly the same role in their analysis as the doubling 
constant does in ours. Nevertheless, these works also assume homogeneous demand, so our 
work can be seen as an extension of searching with comparisons to heterogeneity, with the 
caveat of restricting our analysis to the case where a metric embedding exists. 
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An additional important distinction between iGoval et al.1 ( 20081 ): iLifshits and Zhang 
( 2009 ) : iTschopp and Diggavil ( 2009l . l20ld ) and our work is the existence of a learning phase, 
during which explicit questions are placed to the comparison oracle. A data-structure is 
constructed during this phase, which is subsequently used to answer queries submitted to 
the database during a "search" phase. The above works establish different tradeoffs between 
the length of the learning phase, the space complexity of the data structure created, and the 
cost incurred during searching. In contrast, the learning scheme we consider in Section [6] is 
adaptive, and learning occurs while users search; the drawback lies in that our guarantees 
on the search cost are asymptotic. Again, the main advantage of our approach lies in dealing 
with heterogeneity. 



The use of interactive methods {i.e., that incorporate human feedback) for content 
search has a long history in literature. Arguably, t he first oracl e considered to model such 
methods is the so-called membership oracle (see iGareyl . Il972l ). which allows the search 
mechanism to ask a user questi ons of the form "does the target belong to set A" (see also 



our discussion in Section 13. 4p . iBranson et al.l ( 20ld ) deploy such an interactive method 



for object classificatioi i and evaluate it on the Ani mals with attributes database. A similar 
approach was used bv lGeman and JedvnakI ( 19931 ) who formulated shape recognition as a 
coding problem and applied this approach to handwritten numerals and satellite images. 
Having access to a membership oracle however is a strong assumption, as humans may not 
necessarily be able to answer queries of the above type for any object set A. Moreover, 
the large number of possible sets makes the cost of designing optimal querying strategies 
over large datasets prohibitive. In contrast, the comparison oracle model makes a far 
weaker assumption on human behavior — namely, the ability to compare different objects 
to the target — and significantly limits the design space, making search mechanisms using 
comparisons practical even over large datasets. 



The design of small-world networks (a lso called n aviga ble networks) has received a 
lo t of attention aft e r the seminal work of iKleinbergi ( 200d ). Our work is most similar 



to 



Fraigniaud et al.l ( 20061 ) . where a condition under which graphs embedded in a doubling 



metric space can be made navigable is identi f ied. The same idea was explored in more 
general spaces bv iFraigniaud and Giakkoupia (J20ld ). Again, the main difference in our 
approach to small world network design lies in considering heterogeneous demand, an aspect 
of small-world networks not investigated in earlier work. 



The relationship between th e small-world netw ork design and cont ent search has been 
also o bserved in earlier work (see lGoval et al.l . l2008l ) and was exploited bv lLifshits and Zhang 
( 20091 ) in proposing their data structures for content search through comparisons; we further 
expand on this issue in Section 14.31 as this is an approach we also follow. 



3. Definitions and Notation 



In this section we introduce some definitions and notation which will be used throughout 
this paper. 
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3.1 Objects and Metric Embedding 

Consider a set of objects M , where \M\ = n. We assume that there exists a metric space 
{A4,d), where d{x,y) denotes the distance between x,y G A4, such that objects in M are 
embedded in {Ai,d): i.e., there exists a one-to-one mapping from A/^ to a subset of Ai. 

The objects in J\f may represent, for example, pictures in a database. The metric 
embedding can be thought of as a mapping of the database entries to a set of features 
{e.g., the age of person depicted, her hair and eye color, etc.). The distance between two 
objects would then capture how "similar" two objects are w.r.t. these features. In what 
follows, we will abuse notation and write J\f C Ai, keeping in mind that there might be 
difference between the physical objects (the pictures) and their embedding (the attributes 
that characterize them). 

Given an object z G Af, we can order objects according to their distance from z. We 
will write a; ^^ y if d{x,z) < d{y,z). Moreover, we will write x ~2 y if d{x,z) = d{y,z) 
and x ^z y if j; ^z y but not x ~2 y. Note that ~z is an equivalence relation, and 
hence partitions M into equivalence classes. Moreover, ^^ defines a total order over these 
equivalence classes, with respect to their distance from z. Given a non-empty set A C TV, 
we denote by min.,;^ A the object in A closest to z, i.e. min^^ A = w (z A s.t. w =4z v for all 
V e A. 

3.2 Comparison Oracle 



A comparison oracle (jGoval et al.l . 120081 ) is an oracle that, given two objects x, y and a 



target t, returns the closest object to t. More formally 

Oracle (x, y,t) = < 



X iix^ty, 

y iix>-ty, (1) 

X or y if X ~t y. 



Observe that if x = Oracle(x, y, t) then x ^t y, this does not necessarily imply however that 
X <t y- 

This oracle basically aims to capture the behavior of human users. A human interested 
in locating, e.g., a target picture t within the database, may be able to compare other 
pictures with respect to their similarity to this target but cannot associate a numerical 
value to this similarity. Moreover, when the pair of pictures compared are equally similar 
to the target, the decision made by the human may be arbitrary. 

It is important to note here that although we write Oracle(x, y, t) to stress that a query 
always takes place with respect to some target t, in practice the target is hidden and only 
known by the oracle. Alternatively, following the "oracle as human" analogy, the human 
user has a target in mind and uses it to compare the two objects, but never discloses it until 
actually being presented with it. 

Note that our oracle is weaker than one that correctly identifies the relationship x ^t y 
and, e.g., returns a special character "=" once two such objects are proposed: to see this, 
observe that oracle ([I]) can be implemented by using this stronger oracle. Hence, all our 
results hold if we are provided with such an oracle instead. 
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3.3 Demand 

We denote by A/" x A/" the set of all ordered pairs of objects in J\f . For (s, t) £ Af x Af, we 
will call s the source and t the target of the ordered pair. We will consider a probability 
distribution A over all ordered pairs of objects in Af which we will call the demand. In other 
words, A will be a non-negative function such that X](s t)GA/'xA^-^(*'*) ~ -'-• ^^ general, the 
demand can be heterogeneous as A(s, t) may vary across different sources and targets. We 
refer to the marginal distributions 



is) = ^X{s,t), fi{t) = ^X{s,t), 



t s 

as the source and target distributions, respectively. Moreover, will refer to the support of 
the target distribution 

T = supp(/i) = {x e Af : s.t. fi{x) > 0} 

as the target set of the demand. 

As we will see in Section [5l the target distribution fi will play an important role in our 
analysis. In particular, two quantities that affect the performance of searching in our scheme 
will be the entropy and the doubling constant of the target distribution. We introduce these 
two notions formally below. 

3.4 Entropy 

Let o" be a probability distribution over Af. The entropy of a is defined as 

xGsupp{cr) 

We define the max- entropy of a as 

^max(o-) = max log^—-. (3) 

a'Gsupp{o-) cr(xj 

The entropy has strong connections with the content search problem. More spec i fically , 
suppose that we have access to a so-called membership oracle ( Cover and Thomasl . Il99ll ) 
that can answer queries of the following form: 

"Given a target t and a subset A C Af, does t belong to A7" 

Assume now that an object t is selected according to a distribution /i. It is well known 
that to find a target t on e needs to submit at lea st -ff(/u) queries, on average, to the 
oracle described above (see ICover and Thomasl . Il99ll . chap. 2). Moreover, there exists an 



algorithm (Huffman codin g) that finds the target with only H{fi) + 1 queries on average 
( Cover and Thomaa . Il99ll ). In the worst case, which occurs when the target is the least 



frequently selected object, the algorithm requires ifmax(At) + l queries to identify t. 

Our work identifies similar bounds assuming that one only has access to a comparison 
oracle, like the one described by ([I]). Not surprisingly, the entropy of the target distribution 
H(iJ,) shows up in the performance bounds that we obtain (Theorems [3] and H]) . However, 
searching for an object will depend not only on the entropy of the target distribution, but 
also on the topology of the target set T. This will be captured by the doubling constant of 
/x, which we describe in more detail below. 
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Figure 1: Example of dependence of c(cr) on the topology of the support supp(cr). When 
supp((j) consists of n = 64 objects arranged in a cube, c(o") = 2^. If, on the 



other hand, these n objects are placed on a plane, 
assumed to be uniform, and ii{a^ = log A^ 



c((t) = 2 . In both cases a is 



3.5 Doubling Constant 

Given an object x G A/", we denote by 



B^{r) = {y^M : d{x,y) < r} 



(4) 



the closed ball of radius r > around x. Given a probability distribution a over A^ and a 
set A C Af let (7{A) = ^^eA'^i-'^)- ^^ define the doubling constant c{a) of a distribution a 
to be the minimum c > for which 



a{B,{2r))<c-a{B,{r)), 



(5) 



for any x € supp((T) and any r > 0. Moreover, will say that a is c-doubling if c(/i) = c. 

Note that, contrary to the entropy H{a), the doubling constant c{a) depends on the 
topology of supp(o"), determined by the embedding of TV in the metric space (A4,d). This 
is illustrated in Fig. [H In this example, |7V| = 64, and the set Af is embedded in a 3- 
dimensional cube. Assume that a is the uniform distribution over the A^ objects; if these 
objects are arranged uniformly in a cube, then c{a) = 2^; if however these n objects are 
arranged uniformly in a 2-dimensional plane, c{a) = 2^. Note that, in contrast, the entropy 
of a in both cases equals logn (and so does the max-entropy) . 

4. Problem Statement 

We now formally define the two problems that will be the main focus of this paper. The 
first is the problem of content search through comparisons and the second is the small-world 
network design problem. 



4.1 Content Search Through Comparisons 

For the content search problem, we consider the object set Af, embedded in {Ai,d). Al- 
though this embedding exists, we are constrained by not being able to directly compute 
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Table 1: Summary of Notation 



M 


Set of objects 


iM,d) 


Metric space 


d{x,y) 


Distance between x,y £ M. 


X 4zy 


Ordering w.r.t. distance from z 


X ^zV 


X and y at same distance from z 


A 


The demand distribution 


V 


The source distribution 


^^ 


The target distribution 


T 


The target set 


Hicj) 


The entropy of a 


-ffmax(c) 


The max-entropy of a 


B.{r) 


The ball of radius r centered at x 


c{a) 


The doubhng constant of a 



object distances. Instead, we only have access to a comparison oracle, like the one defined 
in Section 13.21 

Given access to the above oracle, we would like to navigate through J\f until we find 
a target object. In particular, we define greedy content search as follows. Let t be the 
target object and s some object that serves as a starting point. The greedy content search 
algorithm proposes an object w and asks the oracle to select, between s and w, the object 
closest to the target t, i.e., it evokes Oracle(s, w, t). This process is repeated until the oracle 
returns something other than s, i.e., the proposed object is "more similar" to the target 
t. Once this happens, say at the proposal of some w' , \i w' ^ t, the greedy content search 
repeats the same process now from w' . If at any point the proposed object is t, the process 
terminates. 

Recall that in the "oracle as a human" analogy the human cannot reveal t before actually 
being presented with it. We similarly assume here that t is never "revealed" before actually 
being presented to the oracle. Though we write Oracle(x, y, t) to stress that the submitted 
query is w.r.t. proximity to i, the target t is not a priori known. In particular, as we see 
below, the decision of which objects x and y to present to the oracle cannot directly depend 
on t. 

More formally, let Xk,yk be the k-th pair of objects submitted to the oracle: x^ is the 
current object, which greedy content search is trying to improve upon, and y/. is the proposed 
object, submitted to the oracle for comparison with Xk- Let 



Ok = Oracle(xfc,yfc,t) G {xk,yk} 
be the oracle's response, and define 



y-k = {{xi,yi,Oi)}i^i, 



1,2, 



be the sequence of the first k inputs given to the oracle, as well as the responses obtained; 
Tik is the "history" of the content search up to and including the fc-th access to the oracle. 
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The source object is always one of the first two objects submitted to the oracle, i.e., 
xi = s. Moreover, in greedy content search, 

Xk+i = Ok, A; = 1,2,... 

i.e., the current object is always the closest to the target among the ones submitted so far. 
On the other hand, the selection of the proposed object yk+i will be determined by the 
history Tik and the object Xfc. In particular, given Tik and the current object Xk there exists 
a mapping {'Hk,Xk) ^ T(T-Lk,Xk) € Af such that 

yk+i= T{nk,Xk), A; = 0,1,..., 

where here we take xq = s G M (the source/starting object) and Ho = (i.e., before any 
comparison takes place, there is no history). 

We will call the mapping J^ the selection policy of the greedy content search. In general, 
we will allow the selection policy to be randomized; in this case, the object returned by 
T{'Hk,Xk) will be a random variable, whose distribution 

Fi{T{nk,Xk)=w), weM, (6) 

is fully determined by {7ik,Xk). Observe that T depends on the target t only indirectly, 
through Tik and Xk', this is consistent with our assumption that t is only "revealed" when 
it is eventually located. 

We will say that a selection policy is memoryless if it depends on Xk but not on the 
history Tik- In other words, the distribution ([6j) is the same when Xk = x £ M, irrespectively 
of the comparisons performed prior to reaching Xk. 

Our goal is to select J-" so that we minimize the number of accesses to the oracle. In 
particular, given a source object s, a target t and a selection policy T, we define the search 
cost 

Cj-(s, t) = mi{k : yk = t} 

to be the number of proposals to the oracle until t is found. This is a random variable, as J-" is 
randomized; let lE[Cjr(s,t)] be its expectation. The Content Search Through Comparisons 
problem is then defined as follows: 

Content Search Through Comparisons (CSTC): Given an embedding 
of A/" into {A4, d) and a demand distribution A(s, t), select J- that minimizes the 
expected search cost 

C:f= Y. A(s,t)E[C^(s,i)]. 

Note that, as J- is randomized, the free variable in the above optimization problem is the 
distribution (O. 
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4.2 Small- World Network Design 

In the small network design problem, we again consider the objects in N , embedded in 
(Al, d). It is now assumed however that the objects in M are connected to each other. The 
network formed by such connections is represented by a directed graph G{J\f, £U5), where 
C is the set of local edges and S is the set of shortcut edges. These edge sets are disjoint, 
i.e., Cr\S = %. 

The edges in C are typically assumed to satisfy the following property: 

Property 1 For every pair of distinct objects x, t G A/" there exists an object u adjacent to 
X such that {x,u) € C and u -<t x. 

In other words, for any object x and a target t, x has a local edge leading to an object 
closer to t. 

Recall that in the content search problem the goal was to find t (starting from source s) 
using only accesses to a comparison oracle. Here the goal is to use such an oracle to route 
a message from s to t over the link s in graph G. In particular, given graph G, we define 
greedy forwarding ( Kleinbergl . I2OO0I ) over G as follows. Let T{s) be the neighborhood of s, 



i.e., r(s) = {u G Af s.t. (s, u) € CU S}. Given a source s and a target t, greedy forwarding 
sends a message to neighbor w oi s that is as close to t as possible, i.e., 

w = min^jr(s). (7) 

U w ^ t, the above process is repeated at w; ii w = t, greedy forwarding terminates. 

Note that local edges, through Property [U guarantee that greedy forwarding from any 
source s will eventually reach t: there will always be a neighbor that is closer to t than the 
object currently having the message. Moreover, the closest neighbour w selected through 
([7]) can be found using a comparison oracle. In particular, if the message is at an object x, 
|r(x)| queries to the oracle will suffice to find the neighbor that is closest to the target. 

The edges in C are typically called "local" becau se they are usuall y determined by object 
proximity. For example, in the classical paper by iKleinbergI ( 2000 ). objects are arranged 



uniformly in a rectangular A;-dimensional grid — with no gaps — and d is taken to be the 
Manhattan distance on the grid. Moreover, there exists an r > 1 such that any two objects 
at distance less than r have an edge in C In other words, 

C = {{x,y) eM xM s.t. d{x,y) <r}. (8) 

Assuming every position in the rectangular grid is occupied, such edges indeed satisfy 
Property [TJ In this work, we will not require that edges in C are given by ([8]) or some other 
locality-based definition; our only assumption is that they satisfy Property [H Nevertheless, 
for the sake of consistency with prior work, we also refer to edges in C as "local" . 

The shortcut edges S need not satisfy Property [H our goal is to select these shortcut 
edges in a way so that greedy forwarding is as efficient as possible. 

In particular, we assume that we can select no more than /3 shortcut edges, where /3 is 
a positive integer. For S a subset of A/" x A/" such that \S\ < /?, we denote by Gs{s,t) the 
cost of greedy forwarding, in message hops, for forwarding a message from s to t given that 
S = S. We allow the selection of shortcut edges to be random: the set S can be a random 
variable over all subsets S of M x Af such that |5| < /3. 

10 
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We denote by 

Pt{S = S), S cMxJ^ s.t. \S\< 13 (9) 

the distribution of S. Given a source s and a target t, let 

E[Cs{s,t)]= Yl Cs{s,t)-Pi{S = S) 

SCAfxAf:\S\<l3 

be the expected cost of forwarding a message from s to i with greedy forwarding, in message 
hops. 

We consider again a heterogeneous demand: a source and target object are selected at 
random from J\f x J\f according to a demand probability distribution A. The small-world 
network design problem can then be formulated as follows. 

Small- World Network Design (SWND): Given an embedding of A/" into 
{Ai,d), a set of local edges C, a demand distribution A, and an integer /3 > 0, 
select a r.v. S C Af x M that minimizes 

cs= Yl Ks,t)ncsis,t)] 

{s,t)eAfxM 
subject to \S\ < j5. 

In other words, we wish to select S so that the cost of greedy forwarding is minimized. Note 
that, since <S is a random variable, the free variable of the above optimization problem is 
essentially the distribution of S, given by ([9]). 

4.3 Relationship Between SWND and CSTC 

In what follows, we try to give some intuition about how SWND and CSTC are related and 
why the upper bounds we obtain for these two problems are identical, without resorting to 
the technical details appearing in our proofs. 

Consider the following version of the SWND problem, in which we place three additional 
restrictions to the selection of the shortcut edges. First, \S\ = n, i.e., we can only select 
n = \J\f\ shortcut edges. Second, for every x G M, there exists exactly one directed edge 
(x, y) G S: each object has exactly one out-going edge incident to it. Third, the object y 
to which object x connects to is selected independently at each x, according to a probabil- 
ity distribution ix{y)- In other words, for M = {xi,X2, . . . ,2;„}, the joint distribution of 
shortcut edges has the form: 

n 

PT{S = {{xi,yi),...{Xn,yn)})=llixM)- (10) 

i=l 

We call this version of the SWND problein the one edge per object version, and denote it 
by 1-SWND. Note that, in 1-SWND, the free variables are the distributions ix, x G Af, 
which are to be selected in order to minimize the average cost Cg- 
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Figure 2: An illustration of the relationship between 1-SWND and CTSC. In CTSC, the 
source s samples objects independently from the same distribution until it locates 
an object closest to the target t. In 1-SWND, the re-sampling is emulated by the 
movement to new neighbors. Each neighbor "samples" a new object indepen- 
dently, from a slightly perturbed distribution, until one closest to the target t is 
found. 




Consider now following content selection policy for CTSC: 

Fr{J^{xk) =w) = ix^{w), for ah luG A/" 

In other words, if the proposed object at x^ is sampled according to the same distribution 
as the shortcut edge in 1-SWND. This selection policy is memoryless as it does not depend 
on the history Hk of objects presented to the oracle so far. 

A parallel between these two problems can be drawn as follows. Suppose that the same 
source/target pair (s, t) is given in both problems. In content search, while starting from 
node s, the memoryless selection policy draws independent samples from distribution ig 
until an object closer to the target than s is found. 

In contrast, greedy forwarding in 1-SWND can be described as follows. Since shorcut 
edges are generated independently, we can assume that they are generated while the message 
is being forwarded. Then, greedy forwarding at the source object can be seen as sampling 
an object from distribution ig, namely, the one incident to its shortcut edge. If this object 
is not closer to the target than s, the message is forwarded to a neigboring node si over 
a local edge of s. Node si then samples independently a node from distribution ^^^ this 
time — the one incident to its shorcut edge. 

Suppose that the distributions ix vary only slightly across neighboring nodes. Then, 
forwarding over local edges corresponds to the independent resampling occuring in the 
content search problem. Each move to a new neighbor samples a new object (the one 
incident to its shortcut edge) independently of previous objects but from a slightly perturbed 
distribution. This is repeated until an object closer to the target t is found, at which point 
the message moves to a new neighborhood over the shortcut edge. 

Effectively, re-sampling is "emulated" in 1-SWND by the movement to new neighbors. 
This is, of course, an informal argument; we refer the interested reader to the proofs of 
Theorems [2] and Theorem [3] for a rigorous statement of the relationship between the two 
problems. 
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5. Main Results 

We now present our main results with respect to SWND and CSTC. Our first result is 
negative: optimizing greedy forwarding is a hard problem. 

Theorem 1 SWND is NP-hard. 

The proof of this theorem can be found in Section 17.11 In short, the proof reduces 
DominatingSet to the decision version of SWND. Interestingly, the reduction is to a 
SWND instance in which (a) the metric space is a 2-dimensional grid, (b) the distance 
metric is the Manhattan distance on the grid and (c) the local edges are given bj 



Thus, SWND remains NP-hard even in the original setup considered by iKleinberg) (J200C 

The NP-hardness of SWND suggests that this problem cannot be solved in its full gen- 
erality. Motivated by this, as well as its relationship to content search through comparisons, 
we consider below the restricted version 1-SWND. In particular, we provide a distribution 
of edges for 1-SWND for which an upper-bound of search cost exists. This upper-bound 
can be expressed in terms of the entropy and the doubling dimension of the target distribu- 
tion //. Through the relationship of 1-SWND with CSTC, we are able to obtain a greedy 
content search strategy whose cost can also be bounded the same way. 

For a given demand A, recall that fi is the marginal distribution of the demand A over 
the target set T, and that for A C TV, n{A) = J2xeAf^i^)- Then, for any two objects 
X, y G Af, we define the rank of object y w.r.t. object x as follows: 

r^iy)=fi{B.Mx,y))) (11) 

where Bx{r) is the closed ball with radius r centered at x. 

Suppose now that shortcut edges are generated according to the joint distribution (fTO]l . 
where the outgoing link from an object x € A/" is selected according to the following proba- 
bility: 

4(y) oc ^ (12) 

for y € supp(^), while for y ^ supp(^) we define ix{y) to be zero. Eq. (|12p implies the 
following appealing properties. For two objects y, z that have the same distance from x, 
if /i(y) > /Li(z) then lx{y) > ^x{z)-, i.e., y has a higher probability of being connected to 
X. When two objects y,z are equally likely to be targets, if y -<x z then ^^(y) > 4 (2:). 
The distribution (I12p thus biases both towards objects close to x as well as towards objects 
that are likely to be targets. Finally, if the metric space (A^, d) is a /c-dimensional grid and 
the targets are unifo rmly distri b uted over M then ixW) oc {d{x,y))~^. This is the shortcut 
distribution used bv iKleinbergl ( 200d ): (fT2]) is thus a generalization of this distribution to 



heterogeneous targets as well as to more general metric spaces. 

Our next theorem, whose proof is in Section 17.21 relates the cost of greedy forwarding 
under (fT2]l to the entropy H, the max-entropy -ffmax and the doubling parameter c of the 
target distribution /x. 

Theorem 2 Given a demand X, consider the set of shortcut edges S sampled according to 
(fTOl) . where lx{y), x,y G M , are given by (fT2]l . Then 

Cs<6c'{f,)-H{f,)-H^^,{fi). 
13 
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Note that the bound in Theorem [2] depends on A only through the target distribution /x. 
In particular, it holds for any source distribution v, and does not require that sources are 
selected independently of the targets t. Moreover, if 7\A is a fc-dimensional grid and /U is the 
u niform distributio n over M, the above bound becomes 0(log^ n), retrieving thus the result 
of lKleinberd (|20o3). 



Exploiting an underlying relationship between 1-SWND and CSTC, we can obtain an 
efficient selection policy for greedy content search. In particular. 

Theorem 3 Given a demand A, consider the memoryless selection policy Pr{J^{T-Lk,Xk) = 
w) = ixi. {w) where ix is given by (|12p . Then 

Cr < 6c3(/i) ■ H{is) ■ F,„ax(/x). 

The proof of this theorem is given in Section 17. 3[ Like Theorem [21 Theorem [3] characterises 
the search cost in terms of the doubling constant, the entropy and the max-entropy of //. 
This is very appealing, given (a) the relationship between c(;u) and the topology of the 
target set and (b) the classic result regarding the entropy and accesses to a membership 
oracle, as outlined in Section [3l 

A question arising from Theorems [2] and [3] is how tight these bounds are. Intuitively, 
we expect that the optimal shortcut set S and the optimal selection policy T depend both 
on the entropy of the target distribution and on its doubling constant. Our next theorem, 
whose proof is in Section 17.41 establishes that this is the case for J^. 

Theorem 4 For any integer K and D, there exists a metric space {M.,d) and a target 
measure /i with entropy H{iJ,) = Klog{D) and doubling constant c{^) = D such that the 
average search cost of any selection policy J- satisfies 

C^>Hif,) f^J-\ - (13) 

21og(c(/i)) 

Hence, the bound in Theorem [3] is tight within a c^(/u) log(c(/u))-ffmax factor. 

6. Learning Algorithm 

Section [5] established bounds on the cost of greedy content search provided that the distri- 
bution (fT2]l is used to propose items to the oracle. Hence, if the embedding of Af in (Ai, d) 
and target distribution fi are known, it is possible to perform greedy content search with 
the performance guarantees provided by Theorem [3l 

In this section, we turn our attention to how such bounds can be achieved if neither 
the embedding in {Ai, d) nor the target distribution ^ are a priori known. To this end, we 
propose a novel adaptive algorithm that achieves the performance guarantees of Theorem [3] 
without access to the above information. 

Our algorithm effectively learns the ranks r^iy) of objects and the target distribution 
fi as time progresses. It does not require that distances between objects are at any point 
disclosed; instead, we assume that it only has access to a comparison oracle, slightly stronger 
than the one described in Section 14. 2i 
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It is important to note that our algorithm is adaptive: though we prove its convergence 
under a stationary regime, the algorithm can operate in a dynamic environment. For exam- 
ple, new objects can be added to the database while old ones can be removed. Moreover, 
the popularity of objects can change as time progresses. Provided that such changes happen 
infrequently, at a larger timescale compared to the timescale in which database queries are 
submitted, our algorithm will be able to adapt and converge to the desired behavior. 

6.1 Demand Model and Probabilistic Oracle 

We assume that time is slotted and that at each timeslot r = 0, 1, . . . a new query is 
generated in the database. As before, we assume that the source and target of the new 
query are selected according to a demand distribution A over M x M. We again denote by 
u, fi the (marginal) source and target distributions, respectively. 

Our algorithm will require that the support of both the source and target distributions 
is M, and more precisely that 

\{x, y) > 0, for all x,y eM. (14) 

The requirement that the target set T = supp(/i) is J\f is necessary to ensure learning; we 
can only infer the relative order w.r.t. objects t for which questions of the form Oracle(x, y, t) 
are submitted to the oracle. Moreover, it is natural in our model to assume that the source 
distribution v is at the discretion of our algorithm: we can choose which objects to propose 
first to the user/oracle. In this sense, for a given target distribution fi s.t. supp(/i) = M, 
(J14p can be enforced, e.g., by selecting source objects uniformly at random from J\f and 
independently of the target. 

We consider a slightly stronger oracle than the one described in Section 14.11 In partic- 
ular, we again assume that 

Ora.le(....,) = {^ ;[^^;j;; (15) 

However, we further assume that if x ~t y, then Oracle(x, y, t) can return either of the two 
possible outcomes with non-zero probability. This is stronger than the oracle in Section r4.lt 
where we assumed that the outcome will be arbitrary. We should point out here that this 
is still weaker than an oracle that correctly identifies x ~( y (i.e., the human states that 
these objects are at equal distance from t) as, given such an oracle, we can implement the 
above probabilistic oracle by simply returning x or y with equal probability. 

6.2 Data Structures 

For every object x G A/", the database storing x also maintains the following associated data 
structures. The first data structure is a counter keeping track of how often the object x 
has been requested so far. The second data structure maintains an order of the objects in 
J\f; at any point in time, this total order is an "estimator" of =4x, the order of objects with 
respect to their distance from x. We describe each one of these two data structures in more 
detail below. 
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Estimating the Target Distribution The first data structure associated with an object 
X is an estimator of n{x), i.e., the probability with which x is selected as a target. A simple 
method for keeping track of this information is through a counter Cx- This counter Cx is 
initially set to zero and is incremented every time object x is the target. If Cx{t) is the 
counter at timeslot r, then 



fi{x) = Cx{t)/t 



(16) 



is an unbiased estimator of /i(x). To avoid counting to infinity a "moving average" {e.g., 
and exponentially weighted moving average) could be used instead. 

Maintaining a Partial Order The second data structure Ox associated with each x E A/" 
maintains a total order of objects in Af w.r.t. their similarity to x. It supports an operation 
called order() that returns a partition of objects in Af along with a total order over this 
partition. In particular, the output of O3;. order () consists of an ordered sequence of disjoint 
sets Ai,A2,...,Aj, where |J ^i = A/" \ {x}. Intuitively, any two objects in a set Ai are 
considered to be at equal distance from x, while among two objects u € Ai and v £ Aj with 
i < j the object u is assumed to be the closer to x. 

Moreover, every time that the algorithm evokes Oracle{u,v,x), and learns, e.g., that 
u =4x V, the data structure Ox should be updated to reflect this information. In particular, 
if the algorithm has learned so far the order relationships 



Ui 4x Vl, U2 4x V2, 



(17) 



Oa;. order should return the objects in J\f sorted in such a way that all relationships in (J17p 
are respected. In particular, object ui should appear before vi, u^ before v^, and so forth. 
To that effect, the data structure should also support an operation called C'^.add(M,u) that 
adds the order relationship u ^xVio the constraints respected by the output of O^;. order (). 

A simple (but not the most efficient) way of implementing this data structure is to 
represent order relationships through a directed acyclic graph. Initially, the graph's vertex 
set is M and its edge set is empty. Every time an operation add(M,t;) is executed, an edge 
is added between vertices u and v. If the addition of the new edge creates a cycle then all 
nodes in the cycle are collapsed to a single node, keeping thus the graph acyclic. Note that 
the creation of a cycle n—>f— )•...— >u;—)-n implies that u ^x v ~x • • • ~x w, i.e., all 
these nodes are at equal distance from x. 

Cycles can be detected by using depth-first search over the DAG ( Gormen et al.l . I2OOII ). 
The sets Ai returned by order() are the sets associated with each collapsed node, while 
a total order among them that respects the constraints implied by the edges in the DAG 
can be obtained either by depth-first search or by a topological sort ( Gormen et al.l . I2OOII ). 
Hence, the add() and order() operations have a worst case cost of G(n -|- m), where m is 
the total number of edges in the graph. 



S everal more efficient algorithms exist in lite rature (see, for example, iHaeupler et al 



2008 : IPearce and Kellvl.l2(303l: Bender et al.l . l2009l ) , where the best (in terms of perfi 



lormance 



I — ij ■ i^ ^^,^ _^_ ^^ — _ ,^ , — — ■/ J 

proposed by lBender et al.l ( 20091 ) yielding a cost of 0{n) for order() and an aggregate cost 
of at most 0{'n? log n) for any sequence of add operations. We stress here that any of these 
more efficient implementatio ns could be used for ou r purposes. We refe r the re a der interested 
in such implementations to IHaeupler et al.l ( 20081 ): IPearce and Kellvi ( 20031 ): Bender et al. 
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( 2003 ) and, to avoid any ambiguity, we assume the above naive approach for the remainder 



of this work. 

6.3 Greedy Content Search 

Our learning algorithm implements greedy content search, as described in Section 14.11 in 
the following manner. When a new query is submitted to the database, the algorithm first 
selects a source s uniformly at random. It then performs greedy content search using a 
memoryless selection policy J-' with distribution ix, ^-c, 

Fr{Ti'Hk,Xk)=w)=ix,{w) wgM. (18) 

Below, we discuss in detail how £^, x € A/", are computed. 

When the current object Xk, A; = 0, 1, . . ., is equal to x, the algorithm evokes Ox.^.order() 
and obtains an ordered partition Ai,A2,---,Aj of items in A/" \ {x}. We define 

i:wS:Ai 

rxiw)= Y^ fiiAj), weJ\f\{x}. 



This can be seen as an "estimator" of the true rank Vx given by (jlip . The distribution ix 
is then computed as follows: 

Uw) = Pr^^ + J, i = l,...,n-l, 19 

where Zx = Ylw€Af\ix} f'-i'^) /^^i"^) ^^ ^ normalization factor and e > is a small constant. 
An alternative view of ()19p is that the object proposed is selected uniformly at random 
with probability e, and proportionally to jl{wi)/rx{wi) with probability 1 — e. The use of 
e > guarantees that every search eventually finds the target t. 

Upon locating a target t, any access to the oracle in the history T-ik can be used to 
update Ot] in particular, a call Oracle (n,t;,t) that returns u implies the constraint u ^t v, 
which should be added to the data structure through Ot.add(n, v). Note that this operation 
can take place only at the end of the greedy content search; the outcomes of calls to the 
oracle can be observed, but the target t is revealed only after it has been located. 

Our main result is that, as r tends to infinity, the above algorithm achieves performance 
guarantees arbitrarily close to the ones of Theorem [3l Let F{t) be the selection policy 
defined by (fTHj) at timeslot r and denote by 

C{t)= Y. Hs,t)YnCp(,){s,t)] 
{s,t)eAfxM seM 

the expected search cost at timeslot r. Then the following theorem holds: 

Theorem 5 Assume that for any two targets u,v a M, X{u,v) > 0. 

- 6c=^(;U)iJ-(;U)ff,nax(/i) 

limsupG(r) < 

r— >-oo (1 ~ e) 

where c{fi), H{jji) and H^i^x(fi) are the doubling parameter, the entropy and the max entropy, 
respectively, of the target distribution fi. 

The proof of this theorem can be found in Section 17.51 
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7. Analysis 

This section includes the proofs of our theorems. 

7.1 Proof of Theorem [1] 

We first prove that the randomized version of SWND is no harder than its deterministic 
version. Define DetSWND to be the same as SWND with the additional restriction that 
S is deterministic. For any randoin variable S C J\f that satisfies |5| < /3, there exists a 
deterministic set S* s.t. \S*\ < f3 and Cs* < Cg. In particular, this is true for 

S* = arg min Cs(s,t). 

S£Af,\S\<(5 

Thus, SWND is equivalent to DetSWND. In particular, any solution of DetSWND wih 
also be a solution of SWND. Moreover, given a solution S of SWND any deterministic S 
belonging to the support of S will be a solution of DetSWND. 

We therefore turn our attention on DetSWND. Without loss of generality, we can 
assume that the weights X{s, t) are arbitrary non-negative numbers, as dividing every weight 
by X^s(A(s,t) does not change the optimal solution. The decision problem corresponding 
to DetSWND is as follows 

DetSWND-D: Given an embedding of A/" into {M., d), a set of local edges £, a 
non-negative weight function A, and two constants a > and /? > 0, is there a 
directed edge set S such that \S\ < (3 and J2(s t)xj\fxAf ^(■^^^)^si^^^) — '^'^ 

Note that, given the set of shorcut edges S, forwarding a message with greedy forwarding 
from any s to t can take place in polynomial time. As a result, DetSWND-D is in NP. 
We will prove it is also NP-hard by reducing the following NP-complete problem to it: 

DominatingSet: Given a graph G{V, E) and a constant k, is there a set ^4 C y 
such that 1^1 < k and T{A) [J A = V ^ where T[A) the neighborhood of A in G? 

Given an instance {G{V^ E),k) of DominatingSet, we construct an instance of DetSWND- 
D as follows. The set Af in this instance will be embedded in a 2-dimensional grid, and the 
distance metric d will be the Manhattan distance on the grid. In particular, let n = \V\ be 
the size of the graph G and, w.l.o.g., assume that V = {1,2, . . . , n}. Let 

4 = 6n + 3, (20) 

i^=nio + 2 = ^r? -F 3n + 2, (21) 

^2 = ^1 + 3n + 1 = 6n2 + 6n + 3. (22) 

^3 = 4 = 6n + 3, (23) 

We construct a ni x n2 grid, where ni = (n — 1) • ^o + 1 and n2 = ^i + ^2 + ^3 + 1- That is, 
the total number of nodes in the grid is 

iV = [(n - 1) . 4 + 1] . (^1 + ^2 + ^3 + 1) = 0(n^)- 
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Figure 3: A reduction of an instance of DominatingSet to an instance of DetSWND-D. Only 
tlie nodes on the grid that have non-zero incoming or outgoing demands (weights) are 
depicted. The dashed arrows depict Ai, the set of pairs that receive a weight Wi. The 
sohd arrows depict A2 , the set of pairs that receive weight W2 ■ 



The object set J\f will be the set of nodes in the above grid, and the metric space will be 
(Z^, d) where d is the Manhattan distance on Z^. The local edges C is defined according to 
([8]) with r = 1, i.e., and any two adjacent nodes in the grid are connected by an edge in C. 
Denote hy a^, i = 1, . . . ,n, the node on the first column of the grid that resides at row 
(i — 1)^0 + 1- Similarly, denote by bi, ci and dj the nodes on the columns (^i + l), (^1 + ^2 + 1) 
and {li + ^2 + ^3 + 1) the grid, respectively, that reside at the same row as a^, i = 1, . . . , n. 
These nodes are depicted in Figure [3l We define the weight function X{i,j) over the pairs 
of nodes in the grid as follows. The pairs of grid nodes that receive a non-zero weight are 
the ones belonging to one of the following sets: 

Ai = {{ai,bi) \i£V}, 

A2 = {ibi,bj) I (ij) G E}U{{ci,dj) I (ij) G E}U{{ci,di) \i£V}, 

A3 = {{ai,di) \i£V}. 

The sets Ai and A2 are depicted in Fig. [3] with dashed and solid lines, respectively. Note 
that \Ai\ = n as it contains one pair for each vertex in V, \A2\ = 'i\E\ -|- n as it contains 
four pairs for each edge in E and one pair for each vertex in V, and, finally, {A^l = n. The 
pairs in Ai receive a weight equal to Wi = 1, the pairs in A2 receive a weight equal to 
W2 = 3n -|- 1 and the pairs in A^ receive a weight equal to W3 = 1 . 
For the bounds a and /3 take 



a = 2Wi\Ai\+W2\A2\+3\A3\W3 = {3n+l){'i\E\+n)+5n 
P =\A2\+n + k = A\E\ +2n + k. 



(24) 
(25) 



The above construction can take place in polynomial time in n. Moreover, if the graph 
G has a dominating set of size no more than k, one can construct a deterministic set of 
shortcut edges S that satisfies the constraints of DetSWND-D. 

Lemma 6 If the instance of DominatingSet is a "yes" instance, then the constructed 
instance of DetSWND-D is also a "yes" instance. 
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A = {1,4} 





Figure 4: A "yes" instance of DominatingSet and the corresponding "yes" instance of 
DetSWND-D. The graph on the left is can be dominated by two nodes, 1 and 4. The 
corresponding set S of shortcut contacts that satisfies the constraints of DetSWND-D 
is depicted on the right. 



Proof 

To see this, suppose that there exists a dominating set A of the graph with size |^| < k. 
Then, for every i (z V \A, there exists a j & A such that i € r(j), i.e., i is a neighbor of 
j. We construct S as follows. For every i G A, add the edges {ai,hi) and {bi,Ci) in S. For 
every i G ^ \ ^4, add an edge {ai,bj) in S, where j is such that j G A and i € r(j). For 
every pair in A2, add this edge in <S. The size of S is 



21^1 



\V\-\A\) + \A2 



\A\ 



n 



A\E\ +n< 4:\E\ +2n + k. 



Moreover, the weighted forwarding distance is 



(ij)eAi (i,i)eA2 



Y^WsCsiiJ) 



We have 



Yl W2Cs{iJ) 



WolA. 



as every pair in A2 is connected by an edge in S. Consider now a pair aj,6j) £ Ai, i G V. 
There is exactly one edge in S departing from Oj which has the form {ai,bj), where where 
either j = i is or j a neighbor of i. The distance of the closest local neighbor of aj from 6j 
is ^1 — 1. The distance of bj from 6j is at most n ■ £q. As £1 — 1 = n^o + 2 — 1 > nio greedy 
forwarding will follow {ai,bj). If bj = 6j, then Cs{ai,bi) = 1. libj ^bi, as j is a neighbor 
of i, S contains the edge {bj,bi). Hence, if bj / 6j, Cs{ai,bi) = 2. As i was arbitrary, we 
get that 

Yl WiCsii,j)<2Win. 



Next, consider a pair (ai,di) G A^. For the same reasons as for the pair {ai,bi), the 
shortcut edge {ai,bj) in S will be used by the greedy forwarding algorithm. In particular, 
the distance of the closest local neighbor of Oj from di is ^1 + £2 + -^3 — 1 and d{bj, di) is at 
most 1-2 + ^z + n ■ io. As ^1 — 1 > nl^, greedy forwarding will follow (oj, bj). 
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By the construction of S, bj is such that j a A. As a result, again by the construction 
of <S, {hj,Cj) G S. The closest local neighbor of hj to di has I2 + H + d{bj, hi) — 1 Manhattan 
distance from dj. Any shortcut neighbor bk of bj has at least £2 + ^3 Manhattan distance 
from 6j. On the other hand, Cj has £3 + d{bj,bi) Manhattan distance from di. As £2 > 1 
and £2 > nio > d(bj,bi), the greedy forwarding algorithm will follow {bj,Cj). Finally, as 
A2 C S, and j = i 01 j is a neighbor of i, the edge {cj,di) will be in S. Hence, the greedy 
forwarding algorithm will reach dj in exactly 3 steps. As i (zV was arbitrary, we get that 

Yl W3Cs{i,j) = 3Wsn. 

Hence, 

C^ < 2Win + W2\A2\ + SWsn = a 

and, therefore, the instance of DetSWND-D is a "yes" instance. ■ 

To complete the proof, we show that a dominating set of size k exists only if there exists 
a S that satisfies the constraints in constucted instance of DetSWND-D. 

Lemma 7 // the constucted instance of DetSWND-D is a "yes" instance, then the in- 
stance of DominatingSet is also a "yes" instance. 

Proof Assume that there exists a set <S, with \S\ < (3 such that the augmented graph has 
a weighted forwarding distance less than or equal to a. Then 

A2 C S. (26) 

To see this, suppose that A2 % S. Then, there is at least one pair of nodes («, j) in A2 with 
Csihj) > 2. Therefore, 

C5 > 1 • l^il^il + [(|^2l - 1) • 1 + 2] • W2 + 1 • Wsl^sl 
= (3n + 1){4:\E\ + n) + 5n + l>a, 

a contradiction. 

Essentially, by choosing W2 to be large, we enforce that all "demands" in A2 are satisfied 
by a direct edge in S. The next lemma shows a similar result for Ai. Using shortcut edges 
to satisfy these "demands" is enforced by making the distance ii very large. 

Lemma 8 For every i ^V , there exists at least one shortcut edge in S whose origin is in 
the same row as Oj and in a column to the left of bi . Moreover, this edge is used during the 
greedy forwarding of a message from ai to bi. 

Proof Suppose not. Then, there exists an i G 1/ such that no shortcut edge has its origin 
between Oj and 6,, or such an edge exists but is not used by the greedy forwarding from Oj 
to bi {e.g., because it points too far from bi). Then, the greedy forwarding from aj to bi will 
use only local edges and, hence, Cs{ai, bi) = £1. We thus have that 

C^ > 4 + 2n - 1 + 1^21^2! ^ 6n2 + 5n + 1 + 1^21^21 
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On the other hand, by (j24p a = 5n + VF2|^2| so C^ > a, a contradiction. ■ 

Let Si be the set of aU edges whose origin is between some aj and bi, i & V, and that are 
used during forwarding from this Oj to 6j. Note that Lemma [8] imphes that \Si\ > n. The 
target of any edge in Si must he to the left of the 2ii + 1-th column of the grid This is 
because the Manhattan distance of Oj to bi is ii, so its left local neighbor lies at £i — 1 steps 
from bi. Greedy forwarding is monotone, so the Manhattan distance from bi of any target 
of an edge followed subsequently to route towards bi must be less than ii . 

Essentially, all edges in Si must point close enough to bi, otherwise they would not be 
used in greedy forwarding. This implies that, to forward the "demands" in A^ an additional 
set of shortcut edges need to be used. 

Lemma 9 For every i ^V , there exists at least one shortcut edge in S that is used when 
forwarding a message from ai to di that is neither in Si nor in A2. 

Proof Suppose not. We established above that the target of any edge in 5*1 is to the left of 
the 2^1 + 1 column. Recah that A2 = {ibi,bj) \ (ij) G E}U{{ci,dj) \ {i,j) e £'} U {(q, dj) | 
i G V}. By the definition of bi, i €V, the targets of the edges in {{bi,bj) \ {i,j) G E} lie on 
the (^1 + l)-th column. Similarly, the origins of the edges in {{ci,dj) \ {i,j) G E} U {(cj, d,i) \ 
i G V} lie on the ^1 + ^2 + 1-th column. As a result, if the lemma does not hold, there is 
a demand in ^3, say (aj,dj), that does not use any additional shortcut edges. This means 
that the distance between the 2i -\- 1 and the ^i + £2 + 1-th column is traversed by using 
local edges. Hence, Cs{ai,di) > £2 — ^1 + 1 as at least one additional step is needed to get 
to the 2^1 + 1-th column from Oj. This implies that 

C^ >= 2n + W2i^2| +£2-h^ W2\A2\ + 5n + 1 > a, 

a contradiction. ■ 

Let 53 = 5 \ (5i U A2). Lemma [9] implies that ^3 is non-empty, while (i26]) and Lemma El 
along with the fact that |5| < /? = \A2\ + n + k, imply that |S'3| < k. The following lemma 
states that some of these edges must have targets that are close enough to the destinations 
di. 

Lemma 10 For each i (^ V, there exists an edge in S3 whose target is within Manhattan 
distance 3n -\- 1 of either di or Cj, where {cj,di) G A2. Moreover, this edge is used for 
forwarding a message from Oj to di with greedy forwarding. 

Proof Suppose not. Then there exists an i for which greedy forwarding from Oj to di does 
not employ any edge fitting the description in the lemma. Then, the destination di can not 
be reached by a shortcut edge in either S3 or Ai whose target is closer than 3n + 1 steps. 
Thus, di is reached in one of the two following ways: either 3n + 1 steps are required in 
reaching it, through forwarding over local edges, or an edge {cj,di) in A2 is used to reach 
it. In the latter case, reaching Cj also requires at least 3n + 1 steps of local forwarding, as 
no edge in A2 or 5*3 has an target within 3n steps from it, and any edge in Si that may be 
this close is not used (by the hypothesis). As a result, Cs[ai,di) > 3n -|- 2 as at least one 
additional step is required in reaching the ball of radius 3n centered around di or c, from 
Oj. This gives Cg >bn + W2|^2l + 1 > a, a contradiction. ■ 
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When forwarding from aj to dj, i € V, there may be more than one edges in S3 fitting 
the description in Lemma [TOj For each i G V, consider the last of ah these edges. Denote 
the resuhing subset by S3. By definition, I53I < l^sl < k. For each i, there exists exactly 
one edge in 5*3 that is used to forward a message from Oj to di. Moreover, recah that 
^0 = ^3 = 6n+3. Therefore, the Manhattan distance between any two nodes in {ci, . . . , c„}U 
{di, . . . , dn} is 2(3n+l) + l. As a result, the targets of the edges in 5*3 wih be within distance 
3n + 1 of exactly one of the nodes in the above set. 

Let A C V he the set of ah vertices i £ V such that the unique edge in 53 used in 
forwarding from Cj to di has an target within distance 3n + 1 of either q or di. Then 
A is a dominating set of G, and \A\ < k. To see this, note first that |^| < k because 
each target of an edge in S'^ can be within distance 3n + 1 of only one of the nodes in 
{ci, . . . , Cn} U {di, . . . , dn}, and there are at most k edges in 5*3. 

To see that A dominates the graph G, suppose that j £ V \ A. Then, by Lemma [lOl 
the edge in ^3 corresponding to i is either pointing within distance 3n + 1 of either dj or 
a Ci such that {ci,dj) € ^2- By the construction of A, it cannot point in the proximity 
of dj, because then j £ A, a, contradiction. Similarly, it cannot point in the proximity of 
Cj, because then, again, j £ A, a. contradiction. Therefore, it points in the proximity of 
some Ci, where i ^ j and {ci,dj) € A2. By the construction oi A, i £ A. Moreover, by 
the definition of A2, {ci,dj) G A2 if and only if (i,j) G E. Therefore, j G r(^). As j was 
arbitrary, ^ is a dominating set of G. ■ 



7.2 Proof of Theorem [2] 

According to (fT2]l . the probability that object x links to y is given by ix{y) = 'z~7m^ where 
Zx = X^„g7- 1! ( ) is a normalization factor bounded as follows. 

Lemma 11 For any x G Af, let x* G min^^ T be any object in T among the closest targets 
to X. Then Z^ < I + ln(l/^(x*)) < 3i7„ 



'max' 



Proof Sort the target set T from the closest to furthest object from x and index objects 
in an increasing sequence i = 1, . . . ,k, so the objects at the same distance from x receive 
the same index. Let Ai, i = 1, . . . ,k, be the set containing objects indexed by i, and let 
Hi = /i(Aj) and ^0 = A^(a;). Furthermore, let Qi = Yj)=ol^j- Then Z^ = Yli=i q-- 
Define fx{r) : M+ ^ M as 

fxir) = - - fJ-{x). 
r 

Clearly, fx{A') = Y^)=i l^ji for i G {1, 2 . . . , A;}. This means that we can rewrite Zx as 

k 

Zx = Y,{fx{l/Q^) - fx{l/Q^-l))/Qi- 

i=l 

By reordering the terms involved in the sum above, we get 

Zx = fx{^)/Qk + Yl /-( VQO (^ - j^) ■ 

Qk ~^ \Qi Qi+iJ 
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First note that Qyt = 1, and second that since fx{r) is a decreasing function, 

"l/Qi „ , , . . Wn . 1 

'1/Qfc 



Zx<l-fio+ Ur)dr=l-^ + ln—, 

Jl/Qk ^1 VI 



This shows that if /xq = then Z^ < 1 + In — or otherwise Z^. < 1 + hi — . ■ 

Given the set 5, recall that Cs{s,t) is the number of steps required by the greedy 
forwarding to reach t & J\f from s € A/". We say that a message at object v is in phase j if 

Notice that the number of different phases is at most log2 l//i(t). We can write Cs{s,t) as 

Cs{s,t)=Xi+X2 + --- + X, ^, (27) 

where Xj are the hops occurring in phase j. Assume that j > 1, and let 

rt{v) 



I = <w eM : rt{w) < 
The probability that v links to an object in the set /, and hence moving to phase j — 1, is 

Let /ii(r) = n{Bt{r)) and p > be the smallest radius such that /if(p) > rt{v)/2. Since we 
assumed that j > 1 such a /> > exists. Clearly, for any r < p we have fitif) < rt{v)/2. In 
particular, 

A.t(/9/2) < ]^rt{v). (28) 

On the other hand, since the doubling parameter is c(//) we have 

/"t(p/2) > ^/xi(p) > :r^n(7;). (29) 

c(/i) 2c(/i) 

Therefore, by combining (I28p and ()29p we obtain 

^ri(7;) < ^i(p/2) < in(«). (30) 

Let Ip = Bt{p) be the set of objects within radius p/2 from t. Then Ip C /, so 

By triangle inequality, for any w a Ip and y such that d{y,v) < d{v,w) we have 

5 

d{t, y) < d{v, y) + d{v, t) < d{w, y) + d{v, t) < d{w, t) + 2d{v, t) < -d{v, t). 
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This means that rv{w) < fit{^d{v,t)), and consequently, ry{w) < c^{^)rt{v). Therefore, 

1 E«;e/p/^('^) 1 M.P/'^) 






Z^ c^{fi.)rt{v) ZyC^{^i)rt{v)' 



By (I30p . the probabihty of terminating phase j is uniformly bounded by 

1 Lem. [11] 1 

nin > 

V 2c^{fl)Zy 6c3(^)i?max(M) 



1 Lem. [TT] 1 



As a result, the probability of terminating phase j is stochastically dominated by a geometric 
random variable with the parameter given in (j3ip . This is because (a) if the current object 
does not have a shortcut edge which lies in the set /, by Property [H greedy forwarding 
sends the message to one of the neighbours that is closer to t and (b) shortcut edges are 
sampled independently across neighbours. Hence, given that t is the target object and s is 
the source object, 

E[Xj\s,t]<6e^{fi)H^^,{fi). (32) 

Suppose now that j = 1. By the triangle inequality, B^{d{v,t)) C Bt(2d{v,t)) and r„(i) < 
c{fi)rt{v). Hence, 

1 fi{t) ^ 1 ^ 1 



Lt > -^ , / \ . > _ , ,„ > 



Zv c{fi)rt{v) 2c(/i)Z^ 6c(/i)i/max(At) 
since object v is in the first phase and thus iJ,{t) < rt{v) < 2fi{t). Consequently, 

E[Xi|s,t] <6c(/x)i/max(/i). (33) 

Combining (p7|) . ([32]) . (j33|) and using the linearity of expectation, we get 

nCsis,t)] < Qc\ii)Hr^Uii)\og^ 

and, thus, Cs < 6c^{fj,)H^s,^{fi)H{fi). 

7.3 Proof of Theorem [3] 

The idea of the proof is very similar to the previous one and follows the same path. Recall 
that the selection policy is memoryless and determined by 

We assume that the desired object is t and the content search starts from s. Since there 
are no local edges, the only way that the greedy search moves from the current object Xk 
is by proposing an object that is closer to t. Like in the SWND case, we are in particular 
interested in bounding the probability that the rank of the proposed object is roughly half 
the rank of the current object. This way we can compute how fast we make progress in our 
search. 
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As the search moves from s to t we say that the search is in phase j when the rank of 

the current object Xk is between 2^ ^(t) and 2^~^^ii{t). As stated earher, the greedy search 

algorithm keeps making comparisons until it finds another object closer to t. We can write 

Cjr{s, t) as 

C^(s,t)=Xi+X2 + --- + X ^, 

<= M(t) 

where Xj denotes the number of comparisons done by comparison oracle in phase j. Let 
us consider a particular phase j and denote / the set of objects whose ranks from t are at 
most rt{xk)/2. Note that phase j will terminate if the comparison oracle proposes an object 
from set I. The probability that this happens is 

^ Fr{T{'Hk,Xk) = w) = ^ 4^^,^. 

Note that the sum on the right hand side depends on the distribution of shortcut edges and 
is independent of local edges. To bound this sum we can use (I3ip . Hence, with probability 
at least l/(6c^(/i)-ffinax(^))) phase j will terminate. In other words, using the above selec- 
tion policy, if the current object Xk is in phase j, with probability 1 / {6c^ (fi) H^aa.xifJ')) the 
proposed object will be in phase (j — 1). This defines a geometric random variable which 
yields to the fact that on average the number of queries needed to halve the rank is at 
most 6c(/i)^i:fmax or E[Xj|s,i] < 6c(/i)^ffrnax- Taking average over the demand A, we can 
conclude that the average number of comparisons is less than Cjr < 6c^(fj,)HjnaxifJ')H{fi). 

7.4 Proof of Theorem |4] 

Our proof amounts to constructing a metric space and a target distribution /x for which the 
bound holds. Our construction will be as follows. For some integers D, K, the target set 
M is taken as A/" = {1, ... , D}^ . The distance d{x, y) between two distinct elements x, y of 
A/" is defined as d{x,y) = 2"*, where 

m = max {z G {1, . . . , K} : x(K — i) y^ y{K — i)} ■ 

We then have the following 

Lemma 12 Let /x be the uniform distribution over A/". Then (i) c{^) = D, and (ii) if the 
target distribution is ^, the optimal average search cost C* based on a comparison oracle 
satisfies C* >K^. 

Before proving Lemma [T2l we note that Thm. U] immediately follows as a corollary. 
Proof [of Lemma [12] Part (i): Let x = (x(l), . . . x{K)) G Af, and fix r > 0. Assume first 
that r < 2; then, the ball B{x,r) contains only x, while the ball B{x,2r) contains either 
only x if r < 1, or precisely those y (z Af such that 

{y{l),...,y{K-l)) = {x{l),...,x{K-l)) 

if r > 1. In the latter case B{x,2r) contains precisely D elements. Hence, for such r < 2, 
and for the uniform measure on J\f, the inequality 

n{B{x,2r)) < Dfi{B{x,r)) (34) 
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holds, and with equahty if in addition r > 1. 

Consider now the case where r > 2. Let the integer ?7i > 1 be such that r G [2™, 2™+^). 
By definition of the metric d on Af, the ball B{x,r) consists of all y € A/" such that 

(y(l), ...,y{K-m)) = (x(l), ...,x{K- m)), 
and hence contains £)min(i^,m) points. Similarly, the ball B{x,2r) contains /^^^(K.m+i) 
points. Hence (I34p also holds when r > 2. 

Part (ii): We assume that the comparison oracle, in addition to returning one of the 
two proposals that is closer to the target, also reveals the distance of the proposal it returns 
to the target. We further assume that upon selection of the initial search candidate xq, its 
distance to the unknown target is also revealed. We now establish that the lower bound 
on C* holds when this additional information is available; it holds a fortiori for our more 
resticted comparison oracle. 

We decompose the search procedure into phases, depending on the current distance to 
the destination. Let Lq be the integer such that the initial proposal xq is at distance 2^° of 
the target t, i.e. 

(xo(l),...,xo(K-Lo)) = (t(l),...,t(K-Lo)) & xo{K-Lo + l)^t{K-Lo + l). 

No information on t can be obtained by submitting proposals x such that d{x,XQ) ^ 2 °. 
Thus, to be useful, the next proposal x must share its {K — Lq) first components with xq, 
and differ from xq in its {K — Lq + l)-th entry. Now, keeping track of previous proposals 
made for which the distance to t remained equal to 2^" , the best choice for the next proposal 
consists in picking it again at distance 2^° from xq, but choosing for its {K — Lq + l)-th 
entry one that has not been proposed so far. It is easy to see that, with this strategy, the 
number of additional proposals after xq needed to leave this phase is uniformly distributed 
on {1, . . . D — 1}, the number of options for the {K — Lq + l)-th entry of the target. 

A similar argument entails that the number of proposals made in each phase equals 1 plus 
a uniform random variable on {1, . . . ,D — 1}. It remains to control the number of phases. 
We argue that it admits a Binomial distribution, with parameters {K, (D — 1)/D). Indeed, 
as we make a proposal which takes us into a new phase, no information is available on the 
next entries of the target, and for each such entry, the new proposal makes a correct guess 
with probability 1/D. This yields the announced Binomial distribution for the numbers of 
phases (when it equals 0, the initial proposal xq coincided with the target). 

Thus the optimal number of search steps C verifies C > Y2i=ii^ + ^)' where the Yi are 
i.i.d., uniformly distributed on {1, . . . ,D — 1}, and independent of the random variable X, 
which admits a Binomial distribution with parameters (K, {D — 1)/D). Thus using Wald's 
identity, we obtain that 1E[C] > lE[X]lE[Y'i], which readily implies (ii). ■ 

Note that the lower bound in (ii) has been established for search strategies that utilize the 
entire search history. Hence, it is not restricted to memoryless search. 

7.5 Proof of Theorem [5] 

Let A^ = sup^jg^y- \fi{x) — n{x)\. Observe first that, by the weak law of large numbers, for 
any (5 > 

lim Pr(A^ > 5) = 0. (35) 
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i.e., fi converges to fj, in probability. The lemma below states, for every t G A/", the order 
data structure Ot will learn the correct order of any two objects u, v in finite time. 

Lemma 13 Consider u,v,t € J\f such that u ^^ v. Then, the order data structure in t 
evokes Ot.aidd(u,v) after a finite time, with probability one. 

Proof Recall that Ot.add{u,v) is evoked if and only if a call Oracle(n, u , t) takes place 
and it returns u. If u ~<t v then Oiacle{u,v,t) = u. If, on the other hand, u ~t v, then 
Oracle(n, v, t) returns u with non-zero probability. It thus suffices to show that such, for 
large enough r, a call Oracle (u, v, t) occurs at timeslot r with a non-zero probability. By the 
hypothesis of Theorem [5l X{u,t) > 0. By (fT9]) . given that the source is u, the probability 
that T{u) = V conditioned on fl is 



uv) > ,:y ,,: —r+—r > 



l + (n-l)A^n-l n-1 " (l + (n-l)A^)(n-l) 

as Zv < n — 1 and \fi{x) — /u(x)| < A^j for every x € M . Thus, for any 5 > 0, the probability 
that is lower-bounded by 

A(n, t) Pr(^(u) = v)> ^ ^^^^ ~ ^^^ Pr(A^ < 6). 

By taking 5 > smaller than ij,{v), we have by p5p that there exists a r* s.t. for all t > t* 
the probability that Oracle(M, v, t) takes place at timeslot r is bounded away from zero, and 
the lemma follows. ■ 

Thus if t is a target then, after a finite time, for any two u,v & Af the ordered partition 
Ai, . . . ,Aj returned by Ot.order() will respect the relationship between u,v. In particular 
for u G Ai,v € Ai', if « ~t u then i = i', while if u ~<t v then i < i' . As a result, the 
estimated rank of an object u a Ai w.r.t. t will satisfy 

^*W= Yl /^(^)+ Yl A(x) = rt(n) + 0(A^,) 

x£T:x^yU x&N\T:x^A^i ,i' <i 

i.e. the estimated rank will be close to the true rank, provided that A^ is small. Moreover, 
as in Lemma [11] it can be shown that 

Z, < 1+log-^ fi{v) = l+log-i[^, + 0(A^)] 

for V € J\f. From these, for A^ small enough, we have that for u,v a M, 

1 



4(t') = [4W + 0(A^)](l-e) + e 



n — 1 



Following the same steps as the proof of Theorem [2] we can show that, given that A^ < 5, 
the expected search cost is upper bounded by -^z^rroff:. This gives us that 



[Gc^HHr, 



n — 1 



C{t) < -— ^ + 0{S) Pr(A^ <S) + Pr(A^ > 6) 
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where the second part follows from the fact that, by using the uniform distribution with 
probability e, we ensure that the cost is stochastically upper-bounded by a geometric 
r.v. with parameter ^^. Thus, by ([351) . 

limsnp C{t) <^-f^Ih^ + 0{5). 
As this is true for all small enough delta, the theorem follows. 

8. Extensions 

In this section we discuss two possible extensions to the problem of content search through 
comparisons. The first one is about empowering the comparison oracle, namely, assuming 
that one has access to a stronger oracle which is able to return the most similar object to 
the target among a set of objects. If we choose the size of the set to be equal to two, we 
are back to our previous framework. The second one is about content search when we lift 
the assumption that objects are embedded in a metric space. 

8.1 Content Search Beyond Comparison Oracle 

A proximity oracle is an oracle that, given a set A of size at most p and a target t, returns 
the closest object to t. More formally, 

Oracle(A, t) = x, if a; ^ y, Vx, y ^ A. (36) 

Note that the comparison oracle is a special case of the proximity oracle where 1^41 = 2. 
Moreover, accessing k times the comparison oracle, one can implement the proximity oracle. 

Theorem 14 Given a demand A, consider the memoryless and independent selection policy 

P 

Pl{T{nk,Xk) = {wi,W2,...,Wp)) = Jj4fc(Wi) 

i=l 

where ixi^{wi) is given by (J12p . Then the cost of greedy content search is bounded as follows: 

p 

Proof We assume that the target is object t and the content search starts from s. The 
only way that the greedy search moves from the current object Xk is by proposing a set A 
that contains an closer to t. Like in Section [3l we are in particular interested in bounding 
the probability that the rank of the proposed object is roughly half the rank of the current 
object. This way we can compute how fast we make progress in our search. 

As the search moves from s to t we say that the search is in phase j when the rank of 
the current object x^ is between 2^ fi{t) and 2^~^^fi{t). We can write Cj-(s,t) as 

CAs,t)=Xi+X2 + --- + X, 1, 
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where Xj denotes the number of comparisons done by comparison oracle in phase j. Let 
us consider a particular phase j and denote / the set of objects whose ranks from t are 
at most rt{xk)/2. Moreover, let the proposed set by the selection policy be J^{'Hk,Xk) = 
{'Wi,W2, • • • , Wp). Note that phase j will terminate if one of the objects {wi,W2, ■ ■ ■ , Wp) is 
from set /. We denote by Fi, 1 < i < p, the event that Wi € /. Since -Fj's are independent, 
the probability that phase j terminates is 

p 
Y, Pr(FiUF2U---UFp) = 1- ^ J](l - Pr(Fi)) 

{wt,W2,...,Wp)£lP {wi,W2,...,Wp)^lP «=1 

P 



> 






To bound the last expression we can use ([3T]) . Hence, with probability at least p/(6c'^(/i)i:fmax(/^))) 
phase j will terminate. In other words, using the above selection policy, if the current object 
Xk is in phase j, with probability at least p/{6c^{fi)Hjaa.x{fJ')) one of the the proposed objects 
will be in phase (j — 1). This defines a geometric random variable which yields to the fact 
that on average the number of queries needed to halve the rank is at most 6c(//)^iifmax/p 
or E[Xj|s,t] < 6c{n)^H^ax/p- Taking average over the demand A, we can conclude that the 
average number of comparisons is less than Cjr < Qc^[fi)Hyaa.x[fi)H[fi)/p. ■ 



8.2 Content Search Beyond Metric Spaces 

Similarity between objects is a well defined relationship even if the objects are not embedded 
in a metric space. More specifically, the notation x ^^ U simply states that x is more similar 
to z than y. 

If the only information given about the underlying space is the similarity between ob- 
jects, then the maximum we can hope for is for each object x S A/" sort other objects M\y 
according to their similarity to x. 

Given the demand A, the target set T is completely specified. For any y E T let us 
define the rank as follows: 

rx{y) = |{z : z G T,z =^x y}!- 

We say that y £ T is the A;-th closest object to x if rx{y) = k. First not that the rank is in 
general asymmetric, i.e., rx{y) 7^ fy{x). Second, the triangle inequality is not satisfied in 
general, i.e. , r^jy) ^ ry{z) + r'^(x). However the approximate inequality as introduced in 
Goval et al.l (J2008l ) is always satisfied. More precisely, we say that the disorder factor D{ij) 



is the smallest D such that we have the approximate triangle inequality 

rx{y) < D{rz{y) +r^{x)), 
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for all x,y,z € T. The factor D{iJ,) basically quantifies the non- homogeneity of the under- 
lying space when the only give information is order of objects. Let the selection policy for 
the non-metric space be defined as follows: 

Fr{Tink,Xk) = w)(x ^-, (37) 

for w G T. In case w ^ T we define FT{T{7ik,Xk) = w) to be zero. 

It is of high interest to see whether we can still navigate through the database when 
the characterization of the underlying space is unknown and only the similarity relationship 
between objects is provided. This is the main theme of the next theorem. 

Theorem 15 Consider the above selection policy. Then for any demand A, the cost of 
greedy content search is bounded as 

C^<7D{fi)log^\T\. 

The proof of this Theorem is given below. Note again that the selection policy is 
memoryless. Furthermore, it is universal in a sense that using this selection policy for any 
kind of demands guarantees the search that only depends on the cardinality of target set 
and its disorder factor. For instance, this selection policy is useful when the target set is 
only known a priory and the demand is not fully specified. 

Proof The selection policy in the non-metric space scenario is given (I37p which implies 
that only objects in the target set T are going to be proposed by the algorithm. Therefore, 
except for the starting point xq = s, the algorithms navigates only through the target set. 
The probability of proposing w a T when x^ is the current object of the search is given by 



Pr(J"(?^fc,Xfc) =w) 
where Z^^ = ^^jeT "^x^ i"^) ■ Consequently, 



1 1 






\r\ 

where if^ is the n-th harmonic number. Hence, Zx^^ < 2 log jT|. As the search moves from 
s to t we say that the search is in phase j when the rank of the current object v ^ s with 
respect to t is 2^ < rt{v) < 2^^^. Clearly, there are only log |T| different phases. The greedy 
search algorithm keeps proposing to the oracle until it finds another object closer to t. We 
can write Cj^{s, t) as 

C^{s, t) = Xi + X2 + --- + Xiog in + Xs, 

where Xg denotes the number of comparisons done by oracle at the starting point until it 
goes to an object n G T such that rs{u) < rs{t). As before Xj (j > 0) is the number of 
comparisons done by oracle until it goes to the next phase. 

We need to differentiate between the starting point of the process and the rest of it. 
Since unlike other objects proposed by the algorithm, the starting object s may not be in 
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the target set. Let the rank of t with respect to s be k,i.e., rs{t) = k. Then, the probabihty 
that the greedy search algorithm proposes an object w G T such that rs{v) < rs{t) is 
Ylj=nf^ ^ 2iog|7-| • ^^ ^ result E[Xs|s,i] < 21og|T|. This is the average number of 
comparisons performed by the oracle until the greedy search algorithm escapes from the 
starting object s. 

Let the current object w 7^ s be in phase j. We denote by 

/= \u:ueT,rt{u) < ^^4r^ 



the set of objects whose rank from t is at most rt{v)/2. Clearly, |/| = rt{v)/2. The 
probability that the greedy search proposes an object u a I (and hence going to the next 
phase) is at least 



E 



1 1 W rt{v) 



^^^ 2log \T\r,{u) - Uog\r\D{fi){rt{u)+rt{v))' 

where in (a) we used the approximate triangle inequality. Since for u € /, we have rt{u) < 
rt{v)/2, the probability of going from v to the next phase is at least 6-Dlog |T|. Therefore, 
E[Xj\s,t] <6Dlog\T\. 

Using the linearity of expectation, 

E[CT{s,t)] <6Dlog2|r|+21og|r| <7D\og'^\T\. 

The above conditional expectation does not depend on the demand A. Hence, the expected 
search cost for any demand is bounded as IE[Cj-] < 7-Dlog \T\. M 



9. Conclusions 

In this work, we initiated a study of CTSC and SWND under heterogeneous demands, 
tying performance to the topology and the entropy of the target distribution. Our study 
leaves several open problems, including improving upper and lower bounds for both CSTC 
and SWND. Given the relationship between these two, and the NP-hardness of SWND, 
characterizing the complexity of CSTC is also interesting. Also, rather than considering 
restricted versions of SWND, as we did here, devising approximation algorithms for the 
original problem is another possible direction. 

Earlier work on comparison oracles esch ewed metric spaces altogether, exploitin g what 
where referred to as disord er inequalities (JGoval et al.l . I2OO8I : iLifshits and Zhangl . l2009l : 



Tschopp and Diggavil . l2009l ). Applying these under heterogeneity is also a promising re- 



search direction. Finally, trade-offs between space complexity and the cost of the learning 
phase vs. the costs of answering database queries are investigated in the above works, and 
the same trade-offs could be studied in the context of heterogeneity. 
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